true

Rubric: On this page

you will

Some of the hypotheses driving our analysis can be seen below: GDP is a relatively effective signal of a country’s development; thus, “The developed regions waste a greater percentage of food”. “The greater quantity of food production, the greater amount of food waste percentage” “The greater percentage of agriculture as GDP, the greater amount of food wasted”

While our analysis is quantitatively-driven, …

Here, we started our analysis by checking out all the countries in an easy to see heatmap. Right off the bat, one can notice that the United States is a main contributor to worldwide Food Waste, so we may want to focus our attention to this country. Also, we know that the US is one country with a high GDP, so maybe other high GDP countries follow suit. For the map, we had to change the names of some of our countries names in the data sets so that they would match that of the maps library. For example, United States of America changes to USA. Also, note how some countries do not have data which will be explained in our flaws section.

## Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <]8;;https://github.com/tidyverse/dplyr/issueshttps://github.com/tidyverse/dplyr/issues]8;;>.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

US have the most “over 40% food loss per commodity per year” up to today, which is around two times the amount Mexico have. It is unimaginable how some commodity lose almost of its amount during production & retail process. We want to look deeper into which commodity are incurring the most loss in US and if there is any reason behind it.

The following graph indicates that in the US, Pineapple juice, Orange juice and grapefruit juice have been wasted the most over the past decades. One commonality between the most-wasted-commodity is that they are all juice. We may want to explore where did most of the waste occur in the production & retail process of these juice.

The following graph explores the sum food waste per year in the United States. In 2008, US incurred the most amount of food waste in the past five decades. A possible explanation for this severe food waste year is that during the depression, a lot of food are wasted because a lot of retailer and food processor went out of business. We may need more data to back up our hypothesis.

The following data further explore the most wasted food in US. Besides from juice, we observe that Canned mushrooms, Tomatoes and Spinach are also among the highest wasted food in the US.

## # A tibble: 85 × 2
##    commodity        sum_loss_per_year
##    <chr>                        <dbl>
##  1 Pineapple juice              2012.
##  2 Canned mushrooms             1705 
##  3 Orange juice                 1480.
##  4 Grapefruit juice             1460.
##  5 Apple juice                  1307.
##  6 Green garlic                  935.
##  7 Grape juice                   933.
##  8 Tomatoes                      756.
##  9 Spinach                       618.
## 10 Okra                          592.
## # … with 75 more rows

Our whole analysis revolved around food waste loss in different countries over time. Therefore, not only can we focus on food waste loss in a single year, but also food waste within a span. Therefore our questions are based around this concept: Has food waste decreased/increased over time? Is there a relationship between a country’s GDP and food waste ? Are there specific countries that are outliers (waste much more/less food than everyone else) What is the projected food waste loss for a specific region in 2023?

          In building our data analysis, we will be using our “Aggregate Data” file which is the data file that consolidates all the datasets introduced. The main statistical analysis that we have conducted was using regression models, specifically simple and multi-regression.
          These are the packages that will be used in our analysis.
suppressPackageStartupMessages(library(olsrr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(corrplot))
          The first model that we introduce is a simple regression model that examines the relationship between ‘Loss in tonnes’ and ‘Production in tonnes’. Based on these two variables, it appears there exists a positive relationship between ‘Loss’ and ‘Production’. While the p-value indicates statistical significance, and to an extent, answers our initial hypothesis regarding the two variables’ relationship, ‘Production’ alone likely does not account for all ‘Loss’. This can also be inferred from the correlation figures (≈ 0.413), which shows a below-average correlation between the two variables.
## 
## Call:
## lm(formula = Loss ~ Production, data = AggData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -585087131 -115732941  -28574344  142025080  663246104 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.700e+08  8.753e+07   9.939 1.28e-13 ***
## Production  1.818e-02  5.559e-03   3.270  0.00191 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208500000 on 52 degrees of freedom
## Multiple R-squared:  0.1706, Adjusted R-squared:  0.1546 
## F-statistic:  10.7 on 1 and 52 DF,  p-value: 0.00191
##                 Loss Production
## Loss       1.0000000  0.4130241
## Production 0.4130241  1.0000000
          Hence, in this second model, we expanded upon the original simple model into a multi-regression model including other macro predictors that we had originally been interested in: ‘Year’, ‘GDP’, ‘Agriculture as Percentage of GDP’, ‘Percentage of Land used for Agriculture’, ‘Population’, and ‘Production’.
## 
## Call:
## lm(formula = Loss ~ Year + GDP + AgriGDP + AgriLand + Population + 
##     Production, data = AggData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -401785287  -99038421  -10308608  112454861  588290004 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -6.277e+11  2.190e+11  -2.866  0.00625 **
## Year         3.224e+08  1.127e+08   2.860  0.00635 **
## GDP          6.403e-06  9.913e-06   0.646  0.52157   
## AgriGDP     -5.089e+07  7.730e+07  -0.658  0.51358   
## AgriLand     3.005e+08  1.639e+08   1.834  0.07316 . 
## Population  -5.027e+00  1.449e+00  -3.470  0.00114 **
## Production   2.169e-01  1.012e-01   2.143  0.03745 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 182800000 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4239, Adjusted R-squared:  0.3488 
## F-statistic: 5.642 on 6 and 46 DF,  p-value: 0.0001869

          However, upon expanding the model, based on the p-values, it appears that only three out of six predictors proved to be statistically significant. To further this analysis, we generated a correlation matrix to examine potential reasons causing some variables to be “insignificant”. As evident, a large portion of the macro predictors were highly correlated. For example, ‘GDP’ and ‘Population’ is nearly ‘1’ as is ‘Production’ and ‘Year’. This suggests that, potentially, there exists some level of interaction between some of the variables that causes the other to become “insignificant” in the model. The potential interaction between the predictors, therefore, prompts the next step in our analysis. Furthermore, the multi-regression model, likely, is not the “best” model to be used in our analysis as we can observe a negative intercept. The negative intercept itself is logically flawed as it suggests “negative” loss when all other predictor are null-ed. In the next step, using the “olsrr” package, we generated all possible predictors (including interaction terms) to determine the most appropriate predictors based the package’s default statistical tests (p-value and AIC).
All Possible Variables
Index n Predictors R-Square Adj R-Square
1 1 AgriGDP 0.2347192 0.2197137
2 1 AgriLand 0.2133760 0.1982486
3 1 Year 0.1939272 0.1784258
4 1 Population 0.1888800 0.1732816
5 1 Production 0.1705889 0.1546387
6 1 GDP 0.1439178 0.1274546
7 2 Year Population 0.2951072 0.2674643
8 2 GDP AgriLand 0.2525274 0.2232147
9 2 AgriLand Production 0.2519245 0.2225882
10 2 Year AgriLand 0.2475960 0.2180899
          With the output listing the Top 10 predictor variables, it appears that ‘AgriGDP’, ‘AgriLand’, and ‘Year’ are the top three predictors in potentially explaining levels of ‘Loss’. Using this information, we proceeded to construct a second multi-regression model in attempt to examine the relationship they have on ‘Loss’.
## 
## Call:
## lm(formula = Loss ~ Year + AgriGDP + AgriLand, data = AggData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -469534446 -122527744  -10308033  109867013  729052731 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.201e+09  1.740e+10  -0.069    0.945
## Year         8.247e+05  6.025e+06   0.137    0.892
## AgriGDP     -3.736e+07  7.739e+07  -0.483    0.631
## AgriLand     2.481e+07  1.591e+08   0.156    0.877
## 
## Residual standard error: 204100000 on 49 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.1883 
## F-statistic:  5.02 on 3 and 49 DF,  p-value: 0.004101
          In reference to the new model, it appears that similar problems persists. Firstly, the intercept again remains negative, which as previously discussed is flawed. Additionally, the new model that was constructed based on the all variable analysis still gives predictors that prove to be statistically insignifiant. Hence, in our next step we proceeded with a step-wise forward predictor selection process. The package uses the predictor’s p-values to construct a model with statistically significant predictors.
## 
##                                 Selection Summary                                  
## ----------------------------------------------------------------------------------
##         Variable                  Adj.                                                
## Step    Entered     R-Square    R-Square     C(p)         AIC            RMSE         
## ----------------------------------------------------------------------------------
##    1    AgriGDP       0.2347      0.2197    12.1090    2180.4885    200101907.9473    
## ----------------------------------------------------------------------------------
          The conclusion drawn from our variable selection was that ‘AgriGDP’ alone is the most appropriate predictor among all available data in our ‘AggData’ file. Thus, building our final model, we have the simple regression model below:
## 
## Call:
## lm(formula = Loss ~ AgriGDP, data = AggData)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -462030474 -121393937   -9332016  107418209  728487574 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1415466065   73646340  19.220  < 2e-16 ***
## AgriGDP      -49240206   12450043  -3.955 0.000237 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200100000 on 51 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2347, Adjusted R-squared:  0.2197 
## F-statistic: 15.64 on 1 and 51 DF,  p-value: 0.0002369
##               Loss    AgriGDP
## Loss     1.0000000 -0.4844783
## AgriGDP -0.4844783  1.0000000
          The following model demonstrates a negative relationship between ‘Loss’ and ‘Agriculture as a Percentage of GDP’, where correlation ≈ -0.484. This conclusion was initially counter-intuitive as one would likely assume that greater percentage of agriculture as GDP likely means more production and consumption. Upon further analysis (refer back to correlation matrix), we found that ‘GDP’ has a negative correlation to ‘AgriGDP’. Hence, there likely exists factors within the ‘AgriGDP’ data that explains this relationship. For example, ‘AgriGDP’ includes data for both import and export of agricultural goods. The ‘Loss’ data itself reports only loss in production that occurred domestically. Hence, if a country imports a greater proportion of its agricultural goods as opposed to producing domestically, the net effect could indeed be negative. And while the new simple regression model’s correlation between the predictor and response variables is still below-average, there appears to be a slight increase. This, while, more promising, still suggests that there exists other factors outside of our consideration that are more appropriate in helping project food production loss.

We believe that our graphs are actually quite easy to glance at. While making these graphs, we wanted to design them in a way that they could possibly be the first iteration of something that can be used in the “interactive” part of this project. For example, the heatmap we made at the beginning quickly helped us understand which country was a “problem” when it came to food waste.

NOTE: Your Data Analysis can be broken up into multiple pages if that helps with your organization.